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ABSTRACT 


The recent SARS epidemic has boosted interest 
in the discovery of novel human and animal coro- 
naviruses. By July 2007, more than 3000 coronavirus 
sequence records, including 264 complete gen- 
omes, are available in GenBank. The number of 
coronavirus species with complete genomes avail- 
able has increased from 9 in 2003 to 25 in 2007, 
of which six, including coronavirus HKU1, bat 
SARS coronavirus, group 1 bat coronavirus HKU2, 
groups 2c and 2d coronaviruses, were sequenced 
by our laboratory. To overcome the problems we 
encountered in the existing databases during com- 
parative sequence analysis, we built a comprehen- 
sive database, CoVDB (http://covdb.microbiology. 
hku.hk), of annotated coronavirus genes and gen- 
omes. CoVDB provides a convenient platform for 
rapid and accurate batch sequence retrieval, the 
cornerstone and bottleneck for comparative gene 
or genome analysis. Sequences can be directly 
downloaded from the website in FASTA format. 
CoVDB also provides detailed annotation of all 
coronavirus sequences using a_ standardized 
nomenclature system, and overcomes the problems 
of duplicated and identical sequences in other 
databases. For complete genomes, a single repre- 
sentative sequence for each species is available for 
comparative analysis such as phylogenetic studies. 
With the annotated sequences in CoVDB, more 
specific blast search results can be generated for 
efficient downstream analysis. 


INTRODUCTION 


Coronaviruses are found in a wide variety of animals and 
are associated with respiratory, enteric, hepatic and 
neurological diseases of varying severity. Based on 
genotypic and serological characterization, coronaviruses 


were divided into three distinct groups (1-3). As a result of 
the unique mechanism of viral replication, coronaviruses 
have a high frequency of recombination (2,4). 

The recent severe acute respiratory syndrome (SARS) 
epidemic, the discovery of SARS coronavirus (SARS- 
CoV) and identification of SARS-CoV-like viruses from 
Himalayan palm civets and a raccoon dog from wild live 
markets in China have led to a boost in interest on 
discovery of novel coronaviruses in both humans and 
animals (5-9) (Figure 1). For human coronaviruses, a 
novel group | human coronavirus, human coronavirus 
NL63 (HCoV-NL63) was reported in 2004 (10,11), while 
we described the discovery, complete genome sequence and 
genetic diversity of a novel group 2 human coronavirus, 
coronavirus HKU1 (CoV-HKU1) in 2005 (4,12-14). 
As for animal coronaviruses, six group | (15-17), four 
group 2, including bat SARS-CoV and two new subgroups 
of group 2 coronaviruses (6,8,18,19), and 11 group 3 
(20-23) coronaviruses have recently been described. 

By July 2007, more than 3000 coronavirus sequence 
records, including a total of 264 complete genomes, are 
available in GenBank (24). Among the 25 coronavirus 
species with complete genome sequence available, six were 
sequenced by our group, including CoV-HKUI1 and bat 
SARS-CoV (13,16,18,19). Furthermore, we defined two 
novel subgroups of group 2 coronavirus (18). During the 
process of batch sequence retrieval for comparative 
genome analysis of the coronavirus genomes that we 
sequenced, we encountered several major problems about 
the coronavirus sequences in GenBank as well as other 
coronavirus databases (Coronaviridae Bioinformatics 
Resource, http://athena.bioc.uvic.ca/database.php?db = 
coronaviridae; PATRIC http://patric.vbi.vt.edu) (25). 
First, in GenBank, the non-structural proteins in the 
polyprotein encoded by orflab were not annotated. 
Second, in all databases, for the non-structural proteins 
encoded by ORFs downstream to orflab, the annotations 
are often confusing because they are not annotated using 
a standardized system. Third, multiple accession numbers 
are often present for reference sequences (26). These 
problems often lead to confusion when sequence retrieval 
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Figure 1. Number of coronavirus sequences in GenBank from 1984 to 
2006. 


is performed. Fourth, coronaviruses, especially SARS- 
CoV, amplified from different specimens may contain 
the same genome or gene sequences. These sequences 
usually lead to redundant work when they are analyzed. 

In view of these problems, we started to develop our 
own database for coronavirus gene and genome sequences 
in 2005. In this database, CoVDB, we sought to create 
a user-friendly platform for efficient batch sequence 
retrieval, which is crucial for comparative genome 
analysis. In this article, we describe this comprehensive 
database of annotated coronavirus genes and genomes, 
which provides a central source of information about 
coronaviruses. To further increase the usefulness of 
CoVDB, commonly used bioinformatics tools were also 
included for analysis of the sequence data. 


MATERIALS AND METHODS 
Database description 


Sequence data. CoOVDB is a_ web-based coronavirus 
database. Data of CoVDB is stored and managed by 
MySQL database management system. By July 2007, 
CoVDB contains 3982 coronavirus sequences and one 
torovirus genome sequence. Two hundred and sixty-four 
of them are complete genomes and the rest are partial 
genomes or genes. All data were retrieved from GenBank 
using modules of bioperl. We annotated sequences 
without gene information or non-structural protein 
boundary and labeled the 5’ and 3’ untranslated regions 
(UTRs) of the genomes. By July 2007, CoVDB contains 
12344 genes and UTRs. 


Information on coronavirus genome characteristics. In 
addition to the two sequence retrieval pages, CoVDB 
collects information on coronavirus sequence character- 
istics, including genome organization, a brief description 
on each complete coronavirus genome, GC content, 
polyprotein cleavage sites, transcription regulatory 
sequences, acidic tandem repeat sequences and known 
RNA structures. These pieces of information can be 
accessed by clicking ‘Genome’ in the top menu bar of 
CoVDB. In the ‘Tools’ page, blast similarity search (27) 
against annotated coronavirus sequences in CoVDB can 
be performed and other commonly used tools are also 
provided. 
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Functionality of the database 


Batch sequence retrieval. The main goal for setting 
up CoVDB is to provide a convenient and efficient 
platform for retrieving batches of coronavirus gene 
sequences. The interfaces of the database are simple and 
user friendly. All genes and genomes contain links to 
GenBank and/or pubmed. CoVDB contains two main 
pages for sequence retrieval. From the homepage, one 
can enter the first main page for retrieval of complete 
genomes and their genes by clicking ‘CoVDB’ (Figure 2a). 
From this page, users can obtain genes from specific 
coronavirus species by selecting the corresponding 
check boxes. We defined one representative genome 
from each species as the “Type strain’. Most of the time, 
this ‘Type strain’ is the one assigned as the reference 
sequence in GenBank. By choosing the ‘Type strain only’ 
option, users can obtain one gene sequence per species 
and construct phylogenetic tree or perform other compar- 
isons. An example of retrieving complete genome or a 
specific gene of complete genome of selected species is 
shown in Figure 2b and c. 

From the page for retrieval of complete genomes and 
their genes, one can enter the second main page for 
retrieval of all complete and/or incomplete genes of a 
coronavirus (Figure 3a) by clicking ‘From all groups of 
genes’. In this page, all the gene sequences are grouped 
vertically according to which coronavirus group and 
subgroup they belong to, and horizontally by the names 
of the genes. The option ‘Exclude partial CDS’ can be 
used if only complete genes are required. An example of 
retrieving all the sequence of a particular gene for a group 
of coronavirus is shown in Figure 3b. If the translated 
sequence of a selected gene has more than one stop 
codon which is probably due to sequencing error, the 
number in the ‘Length’ column of this gene will be marked 
in red. 


Polyprotein annotation. In all coronavirus genomes, 
orflab occupies two-thirds of the genome and it is 
translated as a polyprotein. This polyprotein is post- 
translationally cleaved by 3C-like protease (3CL""®) and 
papain-like protease (PL?"°) into 15-16 non-structural 
proteins. Some of the non-structural proteins, such as 
RNA-dependent RNA polymerase, helicase, 3CL?”® and 
PL” are essential for replication or virulence of the 
coronavirus, although the functions of others are still 
unclear. Due to the essentiality of the non-structural 
proteins, these sequences are often used for evolutionary 
analysis, primer design, etc. However, except for the 
reference sequences, detailed cleavage site information is 
not provided for the non-structural proteins in other 
sequences in GenBank. Since it has been shown that 
3CLP*° and PLP® of coronavirus cleave at conserved 
specific amino acids, the putative cleavage sites of 
the 15-16 non-structural proteins can be predicted by 
multiple sequence alignment. Using these pieces of 
information, we have annotated these non-structural 
proteins in all the coronavirus sequences for easy retrieval 
in CoVDB. 
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Protein/gene name unification. By convention, all non- 
structural proteins in the polyprotein encoded by orflab 
are named as ‘nsp’, with each protein numbered 
consecutively starting from the 5’ end (nspl—nsp16). 
The structural proteins after the polyprotein are hemag- 
glutinin esterase (HE, in group 2a coronaviruses), spike 
glycoprotein (S), envelope protein (E), membrane protein 
(M) and nucleocapsid protein (N). However, there is 


"erune 


Sequence retrieve 


Get genes from 264 completed genomes only 


Group 1: CIHCoV-229E (1) 
Cpat-Co¥ HKU2 (4) 
Group 2a: MIHCoV-0C43 (5) 
sable antelope (1) LGiraffe (3) 
SARS-CoV: MHuman (131) 
Group 2c: [L)Bat-coV HKU4 (5) LiBat-Co¥ HKUS (4) 
Group 2d: ([)Bat-co¥ HKU9 (4) 
Group 3: Cry (11) 
Torovirus: CIBoT¥ (1) 
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CTGEY (9) 


Palm civet (16) Bat HKU3 (8) 


no unified naming system for the non-structural proteins 
encoded by ORFs downstream to orflab. This lack of 
a unified system greatly reduces the stability and accuracy 
of ortholog retrieval. 

In CoVDB, with the aim of facilitating gene retrieval, 
we tried to unify the naming of these non-structural 
proteins from different groups of coronaviruses. On the 
other hand, we have also tried to avoid radical changes 
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(b) Thanks for searching COV db. Here are the results: 


NCBlacc Shortname Length Strain/isolate Country:Region Organism 


Group PMID GC GCskew 


OO NC_005147 HCoV-0C43 30738 0C43 Belgium Human coronavirus OC43 G2a 15650185 0.368 0.176 
(O NC_006577 CoV-HKU14 29926 N1 China:Hong Kong Human coronavirus HKU14 G2a_ 15613317 0,320 0,188 
OO NC_003045 BCoV 31028 BCoV-ENT USA Bovine coronavirus G2a_ 11714968 0.371 0.174 
0 Nc_001846 MH¥ 31357 MHY-AS9 USA Murine hepatitis virus G2a 9426441 0.417 0.142 
0 NC_007732 PHEV 30480 VW572 Belgium Porcine hemagglutinating encephalomyelitis virus G2Za_ 16809333 0.372 0.164 
OO NC_004718 SARS-human 29751 Tor2 Canada: Toronto Human SARS coronavirus G2b 15020242 0.407 0.020 
OO Av304488 SARS-civet 29731 SZ16 China:Shenzhen Civet SARS coronavirus G2b 12958366 0.408 0.020 
(0 0Q022305 SARS-bat 29728 HKU3-1 China:Hong Kong Bat SARS coronavirus G2b 16169905 0.411 0.027 


8 records! 


Oselect all | Getselected genomes 


Figure 2. Screenshots of CoVDB complete genome retrieval pages. (a) Specific gene can be retrieved using the pull-down list at the left lower corner. 
The number in brackets indicates the number of complete genomes for that coronavirus. (b) Example of showing genomes of selected species (some 
group 2a coronaviruses and SARS-CoV-related coronaviruses). Default is to show the ‘Type strain’ for each species only. The columns NCBlacc and 
PMID link to GenBank and pubmed, respectively. (ec) Example of showing S gene of selected species by choosing S in the pull-down list. For genes 
downstream to orflab, sequences upstream to the initiation codons can also be retrieved from this result page. This function is particularly useful for 


the detection of transcription regulatory sequences. 
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(c) Thanks for searching COV db. Here are the results: 


ntacc Shortname Tag Gene From To 
(1 NC_005147 HCoV-0Cc43 CDSS 
(1 NC_006577 CoV-HKU1A CDSS 


23644 27729 0 
22942 27012 0 


(1 NC_003045 BCoV eps 23641 27732 0 
1 NC_001846 MHV cos Ss 23929 27903 0 
(1 NC_007732 PHEV EBS'S 23427 274760 


(1 NC_004718 SARS-human CDS S 
( AY304488 SARS-civet CDSS 
(1 DQ@022305 SARS-bat cDsSs 


21492 25259 0 
21477 25244 0 
21471 251990 


8 records! 


Shift Length Protein_id Strain/Isolate Group Country: Region 


4086 NP_937950 0c43 G2a 
4071 ¥P_173238 N1 G2a_ China: Hong Kong 
4092 NP_150077 BCoV-ENT G2a USA 
3975 NP_045300 MH¥-459 G2a USA 


Belgium 


4050 YP_459952 vws72 G2a Belgium 

3768 NP_828851 Tor2 G2b Canada: Toronto 
3768 $216 G2b China: Shenzhen 
3729 AAYS8866 HKU3-1 G2b- China:Hong Kong 
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Figure 2. Continued. 


in the names that may lead to confusion. In CoVDB, 
these non-structural proteins are named as NS2a, NS3x, 
NS4x, NS5x and NS7x (x = a, b, c,...). NS2a denotes the 
ORF between orflab and HE of group 2a coronaviruses. 
NS3x denotes the ORFs between S and E of groups 1, 2c, 
2d and 3 coronaviruses. In most of these coronaviruses, 
there are two NS3x, named NS3a and NS3b. However, 
in group | coronaviruses, the genomes of some members 
(e.g. HCoV-NL63, PEDV) contain only one ORF between 
S and E. When we compared their putative amino acid 
sequences to the corresponding ones in other group 1 
coronavirus genomes using BLAST, as well as searching 
for conserved domains using motifscan, results showed 
that the putative proteins encoded by these ORFs 
belonged to a protein family in Pfam originally assigned 
as ‘Corona_NS3b’ (accession number PFO03053). 
Therefore, we named these ORFs as NS3b. NS4x denotes 
the ORFs between S and E of group 2a coronaviruses. 
NS5x denotes the ORFs between M and N of group 3 
coronaviruses. One exception is NS5a of group 2a 
coronaviruses. Traditionally, this name denotes an ORF 
upstream of E in group 2a coronaviruses. Therefore, 
we have kept this name for that ORF in CoVDB. NS7x 
denotes the ORFs downstream of N gene. It is important 
to note that due to variations in genome organizations 
among different groups of coronaviruses (Table 1), 
NS genes with the same name in different coronavirus 
groups may not be orthologs of each other. The complete 
genome gene search page of CoVDB contains a link to a 
Gene synonyms page, which includes a list of synonymous 
names of the various genes in the coronavirus genomes. 


Identical sequence labeling. Sequence redundancy is 
another problem of coronavirus sequences in public 
nucleotide databases. Different strains of the same species 
from samples collected in different locations or at different 


|nt around start position 


times may possess completely or partially identical 
sequences. These sequences, though containing important 
epidemiological information, increase the workload 
during sequence analysis. In CoVDB, we compared all 
nucleotide sequences and labeled the identical ones to 
mitigate this problem. Users can choose to show or not to 
show strains with identical sequences by clicking on the 
check boxes to the left of the page (Figure 3b). 


Blast similarity search. During the process of coronavirus 
gene sequences analysis, we encountered a major problem 
when coronavirus gene sequences, especially those of 
orflab, were used for blast search against GenBank or any 
other coronavirus databases. When part of the orflab 
gene (e.g. nsp5) is used as the query sequence, instead 
of getting the gene for the specific non-structural protein 
that the query sequence is homologous to, the results 
will only show that the hits are within orflab, or in some 
cases, shown to be within the entire coronavirus genome. 
Much time will be needed for further analyzing the 
results manually in order to locate the positions of the 
cleavage sites of the corresponding genes for the non- 
structural proteins, making it very inefficient for further 
downstream work. 

This problem has been overcome by the annotated 
sequences in CoVDB. The blast search page of CoVDB 
is an interface for facilitating coronavirus similarity 
search. The background support program, blastall, is 
from the NCBI Blast package. The blast search page 
can be entered by clicking ‘Tools’ in the top menu bar in 
any page of CoVDB. Since all sequences in CoVDB are 
annotated, they can be grouped into different datasets 
for blast search. Users can choose one of the three 
nucleotide and two protein sequence datasets as the 
database for comparison (Figure 4). The three nucleotide 
sequence datasets are: CoV genes (nsp + genes after lab), 
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Figure 3. Screenshots of all gene retrieval pages. (a) Gene sequences are grouped vertically according to which coronavirus group and subgroup they 
belong to, and horizontally by the name of the genes. The numbers next to each checkbox indicates the number of that gene in CoVDB. The option 
‘Exclude partial CDS’ can be used if only complete genes are required. (b) Example of showing the 15 sequences of nsp13 in group 3 coronaviruses. 
The first column is CoVDB gene id. In the Uniq column, ‘Uniq’ will be shown if there is no other identical sequence in CoVDB. Otherwise, gene id 


of the sequences identical to it will be shown. 


CoV genes (lab + genes after lab) and CoV GenBank 
strains, which are the original sequences retrieved from 
GenBank. The two protein sequence datasets are the 
translated sequences of the first two nucleotide datasets: 
CoV proteins (nsp + aa after lab) and CoV proteins 
(lab + aa after lab). 


MyBlast. ‘MyBlast’ employs the same blast program 
as the Blast page mentioned above. However, instead of 
selecting a predefined nucleotide or amino acid sequence 
database, multiple sequences can be pasted into the second 
sequence input box to generate a temporary sequence 
database. One or more query sequences can be pasted 
into the first sequence input box for blastn or blastp search 
against the temporary sequence database. 


ORF finder for coronavirus. This ORF finder is specifi- 
cally designed for coronavirus genome analysis. The result 
page shows the positions and lengths of each putative 


ORF and the position of the putative ribosomal frame- 
shift site for translation of orflab. The nucleotide 
or amino acid sequences of the ORFs can be shown 
by selecting the corresponding check boxes. To facilitate 
genome comparison and annotation, the most closely 
related coronavirus, which had been annotated in 
CoVDB, can be chosen from a_ pull-down list for 
comparison using blast search. This function is particu- 
larly useful for determining the range of nsp in orflab. 


DISCUSSION 


Rapid and accurate batch sequence retrieval is both the 
cornerstone and bottleneck for comparative gene or 
genome analysis. During the process of complete 
genome sequencing and comparative analysis of the 
various novel human and animal coronavirus genomes 
in the past 2 years, we have developed a comprehensive 
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(b) Thanks for searching COV db. Here are the results: 


- NA: not available 
- Shift: -1 frameshift position. 
- Type: c-complete CDS, p-partial CDS. 
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- Length: Red number means this sequence may contain multiple stop codons due to sequencing or mutation. 
- Country:Region: If no virus location is availabe, this indicates the submitters' country and region. 


- Please let me know if there is any incorrect message. Thanks! 
15! 

ntacc Gene From To 
Ms307 AY319651 nsp13 15163 169650 1803 c 
Ms332 AY514485 nspi3 15136 169350 1800 
Ms407 AY851295 nsp13 15138 169370 1800 
Ms432 DQ001338 nspi3 15132 169310 1800 
Ms4s7 DQ001339 nspi3 15129 169280 1800 
Mss32 DQ288927 nspi3 15125 169240 1800 
Msss7 AY338732 nspi3 15079 16878 0 1800 
Msse2 AY692454 nspi3 15132 169310 1800 
Mogs9 DQ834384 nsp13 15139 16938 0 1800 
Ms357 AY641576 nspi3 15075 168740 1800 
Ms3e2 AY646283 nsp13 14964 16763 0 1800 
Ms4s2 NC_001451 nsp13 15132 169310 1800 
Osso7 AJ311317 nspi3 15132 169310 1800 
0 10006 M94356 nsp13 15132 169310 1800 
OD 10985 230541 nsp13 3843 5642 0 1800 


0 0 © Fey Ge Te eis ia) io ier iio io) ies i 


Shift Length Type Protein_id Short 


NP_?40630 IBV 


Strain/Isolate Country:Region Unig PMID 


IBY BIJ China: Beijing Uniq 0 
IBY Cal9g USA Unig 16927130 
IBY Mass41;M41 USA Unig 0 
IBY IBV-EP3 Singapore Unig 16137658 
IBV IBV-p65 Singapore Uniq 16137658 
IBY SAIBK China: Sichuan Unig 0 
IBY LX4 China: Heilongjiang Unig 15223561 
IBV Beaudette(VC) USA Unig 0 
IBY M41 USA Unig 0 


IBV-peafowl Peafowl/GD/KQ6/2003 China:Guangdong Unig 0 
IBY-partridge Partridge/GD/S14/2003 China: Guangdong Uniq 0 


Beaudette United Kingdom 5482 3027249 
IBY BeaudetteCk United Kingdom 5482 11711626 
IBY Beaudette(M42) United Kingdom 5482 3027249 
IBY Beaudette United Kingdom 5482 3027249 
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Figure 3. Continued. 


Table 1. Genome organization of different groups of coronavirus 


Group Organizations 

1 5/UTR-nsp1-16-S-NS3x-E-M-N-(NS7x)-3’ UTR 

2a 5/UTR-nsp1-16-(NS2a)-HE-S-(NS4x)-NSSa-E-M-N-3’UTR 
2b 5’/UTR-nsp1-16-S-sars3x-E-M-sars6-sars7x-sars8x-N-3’/ UTR 
2c 5’UTR-nsp1-16-S-NS3x-E-M-N-3’UTR 

2d 5/UTR-nsp1-16-S-NS3x-E-M-N-(NS7x)-3/UTR 

3 5’UTR-nsp1-16-S-NS3x-E-M-NS5x-N-(NS7x)-3’ UTR 
database, CoVDB, of annotated coronavirus genes 


and genomes, which offers efficient batch sequence 
retrieval and analysis. As shown by our experience in 
using CoVDB for comparative genome analysis of 
novel coronaviruses we have discovered (4,13,16,18,19), 
we find that CoVDB is more rapid and efficient than 
other existing coronavirus databases for batch sequence 
retrieval for the following reasons. First, we have 
performed annotation on all non-structural proteins in 
the polyprotein encoded by orflab of every single 
sequence. Second, annotation was performed for the 
non-structural proteins encoded by ORFs downstream 
to orflab using a standardized system, with some 
exceptions given to some names that have been used for 
a long time so as to minimize confusion. Third, all 
sequences with identical nucleotide sequences were labeled 
where one can choose to show or not to show strains 
with identical sequences. Fourth, CoVDB contains not 


only complete coronavirus genome sequences, but also 
incomplete genomes and their genes. Some genes of 
coronaviruses, such as pol, spike and nucleocapsid are 
sequenced much more frequently than others because they 
are either most conserved or least conserved. These gene 
sequences are particularly important for evolutionary 
analysis, single nucleotide polymorphism studies and 
design of primers for RT-PCR or quantitative RT-PCR 
amplification. 


Availability 


CoVDB is. constructed by the Department of 
Microbiology, the University of Hong Kong. It is available 
at no charge at http://covdb.microbiology.hku.hk. 
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